In this post, I begin an investigation into the Motor Vehicle Collisions - Crashes dataset from NYC Open Data. The purpose of my inquiry is twofold. First, the Vision Zero initiative is an important effort by the city to reduce traffic deaths and I wanted to get acquainted with some of the data behind it. Second, just to be extra topical and relevant, I am interested to understand the relationship between traffic collisions and the Congestion Relief Zone introduced in early 2025. Through a few posts, my goal will be to explore these data and build some models to better understand the impact that the Congestion Relief Zone has on vehicle crashes (decreased due to lower car volume? increased due to less slow-moving traffic?).
This post will be focused on exploratory data analysis. I will endeavor to get a handle on the main crashes dataset, join it up with some various spatial attributes (borough, census tract, etc.), and create some visualizations to help me understand how to tackle the modeling component of the project. My rough idea is to use a version of this Besag York Mollié (BYM) Bayesian hierarchical model. In the linked paper, Morris et al. use the BYM model to model motor vehicle crashes involving school children at the census tract level. My thinking is that, if I can get their model to work, I could add a Difference-in-Difference component to capture changes in pre- and post-congestion pricing and get some sort of treatment effect. We’ll see how far I get in that but, for now, I present some exploratory data analysis along with my thoughts.
The GitHub repository for this project is available here. Also, note that you can expand code chunks if you wish using the button at the top-right of any visualizations.
In terms of data loading, I’m going to be working with the NYC Motor Vehicle Collisions - Crashes dataset from NYC Open Data and will also be loading (a) the Central Business District shape (i.e., the congestion pricing zone) and (b) NYC Census Tract data so I can aggregate the individual crashes to the census tract level. Links to the various data sources are below:
Below, you’ll find an interactive data table containing a random sample of 10,000 of the 116,097 collisions records left after some data cleaning (you can see the data cleaning steps in the expandable code chunk).
# Helper variables
cp_initial_date <- as.Date("2025-01-05", format = "%Y-%m-%d")
# Loading Central Business District Shape: https://data.ny.gov/Transportation/MTA-Central-Business-District-Geofence-Beginning-J/srxy-5nxn/about_data
cbd_geofence <- read.csv("data/MTA_Central_Business_District_Geofence__Beginning_June_2024_20250605.csv", stringsAsFactors = FALSE)
cbd_geofence <- st_as_sfc(cbd_geofence$polygon, crs = 4326)
# Loading NYC Census Tracts: https://data.cityofnewyork.us/City-Government/2020-Census-Tracts/63ge-mke6/about_data
nyc_tracts <- read.csv("data/2020_Census_Tracts_20250606.csv", stringsAsFactors = FALSE) %>%
rename("geometry" = "the_geom")
nyc_tracts <- st_as_sf(nyc_tracts, wkt = "geometry", crs = 4326)
nyc_tracts <- nyc_tracts %>%
mutate(
area_sqm = Shape_Area / 27878400
) %>%
select(
geometry,
BoroCT2020,
BoroCode,
BoroName,
area_sqm
)
# Loading collisions data: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data
filter <- c(0, NA)
crashes <- read.csv("data/Motor_Vehicle_Collisions_-_Crashes_20250605.csv") %>%
select(CRASH.DATE,
LATITUDE,
LONGITUDE,
NUMBER.OF.PERSONS.INJURED,
NUMBER.OF.PERSONS.KILLED,
NUMBER.OF.PEDESTRIANS.INJURED,
NUMBER.OF.PEDESTRIANS.KILLED) %>%
filter(! LATITUDE %in% filter,
! LONGITUDE %in% filter) %>%
rename(
"date" = "CRASH.DATE",
"lat" = "LATITUDE",
"long" = "LONGITUDE",
"persons_inj" = "NUMBER.OF.PERSONS.INJURED",
"persons_death" = "NUMBER.OF.PERSONS.KILLED",
"ped_inj" = "NUMBER.OF.PEDESTRIANS.INJURED",
"ped_death" = "NUMBER.OF.PEDESTRIANS.KILLED"
) %>%
mutate(
date = mdy(date)
) %>%
st_as_sf(coords = c("long","lat"), crs = 4326)
# Associating CP Zone, Pre/Post, Treatment, Borough, and Census Tract w/ Observations
crashes <- crashes %>%
mutate(
cp_zone = as.integer(lengths(st_intersects(geometry, cbd_geofence)) > 0),
after_cp = ifelse(date >= cp_initial_date, 1, 0),
treatment = ifelse(cp_zone == 1 & after_cp == 1, 1, 0)
) %>%
st_join(nyc_tracts, join = st_within) %>%
filter(! is.na(area_sqm))
# Interactive Data Table
set.seed(50)
crashes_subset <- crashes %>%
sample_n(10000)
crashes_dt <- crashes_subset %>%
st_drop_geometry()
datatable(crashes_dt,
extensions = 'Buttons',
filter = "top",
colnames = c(
"Date",
"Persons Injured",
"Persons Killed",
"Pedestrians Injured",
"Pedestrians Dead",
"CP Zone",
"Before/After CP",
"Treatment",
"Census Tract",
"Borough Code",
"Borough Name",
"Census Tract Area"
),
rownames = FALSE,
options = list(
autoWidth = TRUE,
scrollX = TRUE
),
class = 'compact',
escape = FALSE
) %>%
formatStyle(
columns = names(crashes_dt),
`white-space` = "nowrap",
`height` = "1.5em",
`line-height` = "1.5em"
)
Next, I created an interactive map of the same sample of 10,000 collisions, along with the congestion pricing zone. This doesn’t tell us a whole lot yet, but you can see that lower Manhattan is somewhat of a hotspot for crashes. We’ll explore that with more maps later on.
# Create an interactive map:
ind_crashes_map <- leaflet(data = crashes_subset) %>%
addTiles() %>%
setView(lng = -73.9, lat = 40.7, zoom = 10) %>%
addMarkers(clusterOptions = markerClusterOptions(
maxClusterRadius = 40,
showCoverageOnHover = TRUE),
popup = ~paste(
"Date:", date, "<br>",
"Persons Injured:", persons_inj, "<br>",
"Persons Killed:", persons_death)
) %>%
addPolygons(data = cbd_geofence,
color = "red",
weight = 1,
fillOpacity = 0.5) %>%
addLegend(
position = "bottomright",
colors = "red",
labels = "Within Zone",
title = "Congestion Pricing Zone",
opacity = 0.8
) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addControl(
html = '<h4 style="text-align: right;">
Crash Incidents in NYC<br>
<span style="display: block; margin-top: 0.4em;">2024-25</span>
</h4>',
position = "topright"
) %>%
addResetMapButton()
ind_crashes_map
With these basic visualizations out of the way, I think it would be interesting to begin to aggregate the observations to the tract level. The BYM model mentioned above relies on an Intrinsic Conditional Auto-Regressive (ICAR) term that models counts in a discrete geographic area using surrounding areas. I’ll get into it more in the next post, but TLDR: we need shapes (census tracts) with a single measure (count of crashes).
# Aggregating/summarizing data to census tract level, no time component
crashes_grouped <- crashes %>%
group_by(BoroCT2020) %>%
summarize(tot_crashes = n(),
area = mean(area_sqm)) %>%
ungroup() %>%
mutate(
crashes_per_area = tot_crashes / area
) %>%
st_drop_geometry()
crashes_tracts_overall <- nyc_tracts %>%
right_join(crashes_grouped,
by = "BoroCT2020")
Our first step in the journey of aggregation is to simply aggregate to the census tract level, irrespective of time. That is, sum up the counts of crashes per census tract for the entire range of the data (in this case, 1/1/2024 to 6/1/2025). From there, my first priority was to establish that crash counts tend to cluster. It follows intuitively that census tracts near to each other are likely to have some of the same characteristics that might impact the number of crashes we expect to see: traffic volume, speed limits, road widths. Beyond intuition, though, I want to demonstrate this clustering as justification for the future spatial autoregressive models that I want to run.
First, a simple map of crash counts per square mile for each tract in the city. I added the “per square mile” bit to handle cases where bigger census tracts are more likely to have more crashes, simply due to their size. In the map below, you can pretty easily see some apparent clustering. Further, this clustering tends to occur in places you might expect: busy, road-heavy, traffic heavy areas like Lower Manhattan, Downtown Brooklyn, and East Harlem/the Bronx. Notably, the Lower Manhattan area is pretty much consistent with the congestion pricing zone. The fact that this seems to be a hotspot is expected, but good to visually confirm.
# Mapping crashes per square mile across the city
choropleth_overall <- tm_shape(crashes_tracts_overall) +
tm_polygons(fill = "crashes_per_area",
fill.scale = tm_scale_intervals(values = "brewer.yl_or_rd",
style = "jenks"),
fill.legend = tm_legend(title = "Crashes per Sq. Mile"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
choropleth_overall
For all the visual clarity of the graph, it’s always a good idea to let statistics back up what you anecdotally observe.
There’s an inferential statistic for clustering: Moran’s I, which measures spatial autocorrelation. In other words, it tests if there is more (or less) correlation between nearby/bordering areas than a random distribution of values would suggest. In statistical inference speak, the null hypothesis is that there is no spatial clustering, and the alternative hypothesis is that areal units near to each other are more correlated. I hope I got all that right.
In more simple speak, if there’s a statistically significant result from the test, that’s a good sign for there being clustering. Given the visual examination above, it seems likely that we get a statistically significant result.
I should note here that, as you will see later on, the distribution of crash counts is positively skewed (i.e., has a long right tail, which makes sense given the lower bound at 0). Apparently, Moran’s I is pretty sensitive to this issue, and skewness is likely to result in more false positives. To remedy this issue, I transformed the crash counts to the square root scale (initially, I tried the log scale, but this resulted in a left skewed distribution and cannot handle counts of 0). This transformation was only for this piece of the analysis and, unless I say otherwise, I’ll be using raw crash counts moving foward.
# Sqrt transform data to enforce normal distribution for Moran's I
crashes_tracts_overall <- crashes_tracts_overall %>%
mutate(
sqrt_crashes_per_area = sqrt(crashes_per_area)
)
# Get weights matrix
# Identify tracts with no neighbors
neighbors <- poly2nb(crashes_tracts_overall, queen = TRUE)
no_neighbors <- which(card(neighbors) == 0)
# Filter the original spatial dataframe to exclude these tracts
if (length(no_neighbors) > 0) {
crashes_tracts_overall_clean <- crashes_tracts_overall[-no_neighbors, ]
} else {
crashes_tracts_overall_clean <- crashes_tracts_overall
}
# Rebuild neighbors and weights on the cleaned spatial data
neighbors_clean <- poly2nb(crashes_tracts_overall_clean, queen = TRUE)
weights_clean <- nb2listw(neighbors_clean, style = "W", zero.policy = TRUE)
# Conducting Moran's I test for global clustering
moran_global <- moran.test(crashes_tracts_overall_clean$sqrt_crashes_per_area,
weights_clean,
zero.policy = TRUE)
# Create a summary data frame
morans_df <- tibble(
Statistic = c("Moran's I", "Expected I", "Variance", "P-value"),
Value = c(
moran_global$estimate["Moran I statistic"],
moran_global$estimate["Expectation"],
moran_global$estimate["Variance"],
moran_global$p.value
)
)
# Print as a pretty table
knitr::kable(morans_df, digits = 5) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
| Statistic | Value |
|---|---|
| Moran’s I | 0.59139 |
| Expected I | -0.00043 |
| Variance | 0.00015 |
| P-value | 0.00000 |
And hey presto! The p-value of < 0.00001 (it’s actually way, way lower than that, but I couldn’t quickly figure out a good way to show it) indicates there’s strong spatial correlation, and there’s less than a 1-in-100,000 chance that this clustering occurred by chance, roughly speaking. I’m really not that concerned with the perfect interpretation of the test results, this was more about proving there to be spatial correlation.
Our next step will be to disaggregate the data into discrete time buckets, because I’m interested not only in modeling the spatial autocorrelation between the tracts, but also the impact of the congestion pricing zone implementation on 1/5/2025.
Now, we’ll zoom in with a bit more granularity on our numbers. Rather than aggregating to a single number across the whole time range of the data, I’ll instead aggregate to monthly counts. In order to better visualize these trends, I’ll drop the census tract granularity for now.
One major reason for the following visualizations is that I’m intending on doing some version of a Difference-in-Difference (DiD) analysis. One of the key assumptions in DiD is “parallel trends.” In essense, this means that pre-treatment, we want the treatment and control groups to be trending the same way. This makes sense–if they’re not trending the same way, then none of the causal inference assumptions that we want to make are going to be satisfied (i.e., the treatment and control groups are not similar enough for a comparison). Now, there are statistical ways of testing this assumption, and I will get to these in the next post, but a visual inspection can give us a reasonable sense of where we lie in relation to the assumption. Further, this is one of those statistical assumptions that isn’t really provable–we just want enough evidence in favor that we say “eh, ok, looks pretty good” and move on.
First, a plot of monthly crashes (per square mile, to give equivalency across boroughs of differing size). The included vertical dashed line indicates when the congestion pricing policy was put into effect. Right off the bat, these look pretty parallel to me. There’s more variation in the boroughs with higher crash per square mile numbers, but that’s intuitive and expected. It should be noted that this isn’t really the parallel trends assumption in action, since Manhattan hasn’t been split out into the CP Zone and non-CP Zone.
# Getting borough sq. miles to scale crash counts
nyc_boro <- nyc_tracts %>%
st_drop_geometry() %>%
group_by(BoroName) %>%
summarize(
area = sum(area_sqm)
)
# Extracting Month + Year for each observation for grouping
crashes <- crashes %>%
mutate(
m_y = as.Date(paste0(format(as.Date(date), "%Y-%m"), "-01"))
)
# Aggregating/summarizing data to borough level, along with month+year
crashes_grouped_monthboro <- crashes %>%
group_by(m_y, BoroName) %>%
summarize(tot_crashes = n(), .groups = "drop") %>%
st_drop_geometry() %>%
mutate(
BoroName = as.factor(BoroName)
) %>%
filter(m_y < "2025-06-01") %>%
left_join(nyc_boro,
by = "BoroName") %>%
mutate(
crashes_per_sqm = tot_crashes / area
)
# Time series plot
scale_boro <- c("Bronx" = "#E69F00",
"Brooklyn" = "#56B4E9",
"Manhattan" = "#009E73",
"Queens" = "#CC79A7",
"Staten Island" = "#0072B2")
ts_boro_plot <- ggplot(data = crashes_grouped_monthboro,
aes(x = m_y,
y = crashes_per_sqm,
group = BoroName)) +
geom_line(aes(color = BoroName)) +
theme_minimal() +
geom_segment(aes(x = as.Date("2025-01-01"), xend = as.Date("2025-01-01"),
y = 0, yend = 60),
color = "black", linetype = "dashed", linewidth = 0.75) +
labs(title = "Monthly Crashes per Square Mile, by Borough",
x = "",
y = "Crashes") +
annotate("text",
x = as.Date("2025-05-01"),
y = 27,
label = "Bronx",
color = "#E69F00",
hjust = 1) +
annotate("text",
x = as.Date("2025-05-01"),
y = 37,
label = "Brooklyn",
color = "#56B4E9",
hjust = 1) +
annotate("text",
x = as.Date("2025-05-01"),
y = 48,
label = "Manhattan",
color = "#009E73",
hjust = 1) +
annotate("text",
x = as.Date("2025-05-01"),
y = 16,
label = "Queens",
color = "#CC79A7",
hjust = 1) +
annotate("text",
x = as.Date("2025-05-01"),
y = 4,
label = "Staten Island",
color = "#0072B2",
hjust = 1) +
annotate("text",
x = as.Date("2025-01-05"),
y = 62,
label = "Congestion Pricing Starts",
color = "black") +
scale_color_manual(name = "Borough",
values = scale_boro) +
theme(legend.position = "none") +
scale_x_date(
limits = as.Date(c('2024-01-01', '2025-05-01')),
breaks = seq(as.Date('2024-01-01'), as.Date('2025-05-01'), by = '2 months'),
date_labels = "%b %Y"
)
ts_boro_plot
Now, we can look at the treatment and control groups. A note before I talk about the graph. In my final modeling, I’ll be doing what any good modeler does and including pre-treatment covariates into the model to adjust for differences between tracts. This might include things like density, measures of road speed limits, etc. Absent that effort, and wanting to just get a quick visual inspection, I simply filtered the dataset to only Manhattan tracts, and compared tracts within the zone and outside the zone. This is hardly ideal, since tract characteristics can, and do, vary within Manhattan.
# Area of CP Zone (and converted from m^2 to square miles)
cp_area <- as.numeric(sum(st_area(cbd_geofence))) / 2589988.11
manhattan_tot_area <- as.numeric(nyc_boro[nyc_boro$BoroName == "Manhattan", "area"])
# Aggregating/summarizing data to CP Zone/Non-CP Zone, along with month+year
crashes_grouped_monthzone <- crashes %>%
filter(BoroName == "Manhattan") %>%
group_by(m_y, cp_zone) %>%
summarize(tot_crashes = n(), .groups = "drop") %>%
st_drop_geometry() %>%
filter(m_y < "2025-06-01") %>%
mutate(
cp_zone_verbose = as.factor(ifelse(cp_zone == 1, "In Zone", "Out of Zone")),
area = ifelse(cp_zone == 1, cp_area, manhattan_tot_area - cp_area),
crashes_per_sqm = tot_crashes / area
)
# Time series plot
scale_zone <- c("In Zone" = "blue",
"Out of Zone" = "red")
ts_zone_plot <- ggplot(data = crashes_grouped_monthzone,
aes(x = m_y,
y = crashes_per_sqm,
group = cp_zone_verbose)) +
geom_line(aes(color = cp_zone_verbose)) +
theme_minimal() +
geom_segment(aes(x = as.Date("2025-01-01"), xend = as.Date("2025-01-01"),
y = 0, yend = 70),
color = "black", linetype = "dashed", linewidth = 0.75) +
labs(title = "Monthly Crashes per Square Mile, in Manhattan, by CP Zone",
x = "",
y = "Crashes") +
annotate("text",
x = as.Date("2025-05-01"),
y = 66,
label = "In Zone",
color = "blue",
hjust = 1) +
annotate("text",
x = as.Date("2025-05-01"),
y = 42,
label = "Out of Zone",
color = "red",
hjust = 1) +
annotate("text",
x = as.Date("2025-01-05"),
y = 72,
label = "Congestion Pricing Starts",
color = "black") +
scale_color_manual(name = "Zone",
values = scale_zone) +
theme(legend.position = "none") +
scale_x_date(
limits = as.Date(c('2024-01-01', '2025-05-01')),
breaks = seq(as.Date('2024-01-01'), as.Date('2025-05-01'), by = '2 months'),
date_labels = "%b %Y"
)
ts_zone_plot
Anyway, I’m reasonably happy with this. Pre-treatment, the trends look pretty parallel, and there’s even some funky things going on post-treatment. We see a big drop in within-zone crash numbers from December 2024 to January 2025, right when policy was put into effect, but the a perhaps disproportionate increase in collisions again after that. Not enough to make any claims, but it does make me interested to model it.
Now, for the two final things I want to get done in this post: have a quick look at crash distributions, and set up a final data table that I can export out of this workbook and into my modeling one.
We can see in the ridgeplot below that monthly crashes per square mile follows some very normal-looking distributions for Staten Island, the Bronx, and Queens. Manhattan and Brooklyn, though, are (a) a bit more bumpy and (b) more spread out. Again, the increased spread makes sense given the differences in magnitudes between, say, Staten Island and Manhattan. The bumpiness is a bit more interesting and I’m not entirely sure what’s going on with it. That said, both Manhattan and Brooklyn seem somewhat non-skewed (ok, Brooklyn is a little skewed), so I think these look ok. I believe my intended future models are going to use the Poisson distribution (and perhaps normal, I still have to revisit this), and these look plausibly Poisson-ish.
crashes_grouped_monthboro$BoroName <- factor(
crashes_grouped_monthboro$BoroName,
levels = c("Staten Island", "Queens", "Bronx", "Brooklyn", "Manhattan")
)
boro_ridgeplot <- ggplot(data = crashes_grouped_monthboro,
aes(x = crashes_per_sqm,
y = BoroName)) +
geom_density_ridges(aes(fill = BoroName),
scale = 0.9,
alpha = 0.7,
bandwidth = 0.8) +
theme_minimal() +
theme(legend.position = "none",
axis.text = element_text(size=10,face="bold")) +
scale_fill_manual(values = scale_boro) +
labs(title = "Monthly Crashes per Square Mile Density, by Borough",
x = "Crashes per Square Mile",
y = "") +
scale_x_continuous(
limits = c(0, NA),
breaks = seq(0, max(crashes_grouped_monthboro$crashes_per_sqm, na.rm = TRUE) + 10, by = 10)
)
boro_ridgeplot
And, finally, below you can see a sample of the final dataset that I can export out of here. One big thing: some tracts span the treatment groups (i.e., lie half-in and half-out of the congestion pricing zone), and one month (January 2025) spans pre-post treatment. I’m gonna do some forking paths here, but transparently. First, I decided that if any of a tract falls within the congestion pricing zone, the whole tract will be counted as treated. Not perfect, but given that we just modeled how crash counts are correlated with nearby tracts, I think it’s reasonable to assume. Second, I decided that all of January, 2025 should be “treated.” Again, not ideal, but there were only 4 non-treated days in that month and the alternative was to either count it as entirely untreated (which is even more unreasonable) or try to lump the first 4 days in with December, 2024, which also seems like it would fudge results to make it look weird.
# Final table to export
crashes_tract_month <- crashes %>%
group_by(m_y, BoroCT2020) %>%
summarize(tot_crashes = n(),
area_sqm = mean(area_sqm),
cp_zone = mean(cp_zone),
after_cp = mean(after_cp),
.groups = "drop") %>%
st_drop_geometry() %>%
filter(m_y < "2025-06-01") %>%
mutate(crashes_per_sqm = tot_crashes / area_sqm,
cp_zone = ifelse(cp_zone == 0, 0, 1),
after_cp = ifelse(after_cp == 0, 0, 1),
treatment = ifelse(cp_zone + after_cp == 2, 1, 0)) %>%
rename(
"time_period" = "m_y",
"census_tract" = "BoroCT2020"
)
# Interactive Datatable
set.seed(50)
crashes_tract_month_sample <- crashes_tract_month %>%
sample_n(10000) %>%
mutate(
area_sqm = round(area_sqm, 3),
crashes_per_sqm = round(crashes_per_sqm, 3)
)
datatable(crashes_tract_month_sample,
extensions = 'Buttons',
filter = "top",
colnames = c(
"Time Period",
"Census Tract",
"Total Crashes",
"Tract Area (Sq. Miles)",
"CP Zone",
"Before/After CP",
"Crashes per Square Mile",
"Treatment"
),
rownames = FALSE,
options = list(
autoWidth = TRUE,
scrollX = TRUE
),
class = 'compact',
escape = FALSE
) %>%
formatStyle(
columns = names(crashes_tract_month_sample),
`white-space` = "nowrap",
`height` = "1.5em",
`line-height` = "1.5em"
)
And that’s it! In the next post, I’ll be test the parallel trends assumption, working on a node-matrix to define how my model will treat the autocorrelations between tracts, and attempting to get Stan to play ball so I can model crashes at both a general level and (hopefully) with that DiD term.